Video action segmentation aims to slice the video into several action segments. Recently, timestamp supervision has received much attention due to lower annotation costs. We find the frames near the boundaries of action segments are in the transition region between two consecutive actions and have unclear semantics, which we call ambiguous intervals. Most existing methods iteratively generate pseudo-labels for all frames in each video to train the segmentation model. However, ambiguous intervals are more likely to be assigned with noisy and incorrect pseudo-labels, which leads to performance degradation. We propose a novel framework to train the model under timestamp supervision including the following two parts. First, pseudo-label ensembling generates pseudo-label sequences with ambiguous intervals, where the frames have no pseudo-labels. Second, iterative clustering iteratively propagates the pseudo-labels to the ambiguous intervals by clustering, and thus updates the pseudo-label sequences to train the model. We further introduce a clustering loss, which encourages the features of frames within the same action segment more compact. Extensive experiments show the effectiveness of our method.
translated by 谷歌翻译
流行的图神经网络模型在图表学习方面取得了重大进展。但是,在本文中,我们发现了一个不断被忽视的现象:用完整图测试的预训练的图表学习模型的表现不佳,该模型用良好的图表测试。该观察结果表明,图中存在混杂因素,这可能会干扰模型学习语义信息,而当前的图表表示方法并未消除其影响。为了解决这个问题,我们建议强大的因果图表示学习(RCGRL)学习可靠的图形表示,以防止混杂效应。 RCGRL引入了一种主动方法,可以在无条件的力矩限制下生成仪器变量,该方法使图表学习模型能够消除混杂因素,从而捕获与下游预测有因果关系的歧视性信息。我们提供定理和证明,以保证拟议方法的理论有效性。从经验上讲,我们对合成数据集和多个基准数据集进行了广泛的实验。结果表明,与最先进的方法相比,RCGRL实现了更好的预测性能和泛化能力。
translated by 谷歌翻译
由于关键字相关互联网页面的返回,根据关键字检索的搜索引擎不再适应智能互联网时代的信息获取方式。如何快速,准确和有效地获取来自大规模互联网数据的用户所需的信息已成为迫切需要解决的关键问题之一。我们提出了一个基于结构化KB和非结构化数据的智能质疑答案系统,称为OpenQA,其中用户可以提供查询问题,并且模型可以快速向用户提供准确的答案。我们基于语义解析和深度表示学习的KBQA结构化问题回答,以及基于检索和神经机阅读理解的两级非结构化问题回答,并通过OpenQA中的变压器应答选择模块回归最高概率的最终答案。我们对我们构建的数据集进行了初步实验,实验结果证明了提出的智能问题应答系统的有效性。与此同时,OpenQA平台的每个模块的核心技术仍处于学术热点的最前沿,并基于这些学术热点进一步探索了OpenQA的理论本质和富集。
translated by 谷歌翻译
虽然视觉变形金机在许多视觉任务中实现了骨干模型的优异性能,但大多数都打算捕获图像或窗口中所有令牌的全局关系,这会破坏2D结构中的补丁之间固有的空间和本地相关性。在本文中,我们介绍了一个名为SimVit的简单视觉变压器,将空间结构和本地信息合并到视觉变压器中。具体而言,我们引入多头中央自我关注(MCSA)而不是传统的多头自我关注以捕获高度局部关系。滑动窗口的引入有助于捕获空间结构。同时,SIMVIT从不同层提取多尺度分层特征以进行密集的预测任务。广泛的实验表明,SIMVIT作为各种图像处理任务的通用骨干模型是有效和高效的。特别是,我们的SIMVIT-MICRO只需要3.3M的参数,在Imagenet-1K数据集上达到71.1%的前1个精度,即现在是最小的尺寸视觉变压器模型。我们的代码将在https://github.com/cucasligang/simvit中提供。
translated by 谷歌翻译
多变量时间序列预测是一个具有挑战性的任务,因为数据涉及长期和短期模式的混合,具有变量之间的动态时空依赖性。现有图形神经网络(GNN)通常与预定义的空间图或学习的固定邻接图模拟多变量关系。它限制了GNN的应用,并且无法处理上述挑战。在本文中,我们提出了一种新颖的框架,即静态和动态图形学习 - 神经网络(SDGL)。该模型分别从数据获取静态和动态图形矩阵分别为模型长期和短期模式。开发静态Matric以通过节点嵌入捕获固定的长期关联模式,并利用图规律性来控制学习静态图的质量。为了捕获变量之间的动态依赖性,我们提出了基于改变节点特征和静态节点Embeddings生成时变矩阵的动态图。在该方法中,我们将学习的静态图信息作为感应偏置集成为诱导动态图和局部时空模式更好。广泛的实验是在两个交通数据集中进行,具有额外的结构信息和四个时间序列数据集,这表明我们的方法在几乎所有数据集上实现了最先进的性能。如果纸张被接受,我将在GitHub上打开源代码。
translated by 谷歌翻译
深度散列在大规模图像检索中显示了有希望的性能。然而,由\ textBF {d} EEP \ TextBF {n} EETURT \ TextBF {n} etwork(DNN)提取的潜在代码将在二值化过程中不可避免地丢失语义信息,这损害了检索效率并使其充满挑战。虽然许多现有方法进行正规化以缓解量化错误,但我们弄清楚了度量和量化损耗之间的不兼容冲突。公制损失惩罚了阶级距离,以推动远处的不受约束的不同类别。更糟糕的是,它倾向于映射潜在的代码偏离理想的二值化点,并在二值化过程中产生严重的模糊性。基于二进制线性代码的最小距离,提出了提出基于二进制线性代码的最小距离,\ textbf {h}灰色引导\ textbf {h} Inge \ textbf {f}发射(hhf)以避免这种冲突。详细说明,我们仔细设计了一个特定的拐点,依赖于散列长度和类别号来平衡度量学习和量化学习。这种修改可防止网络落入深度散列中的局部度量最佳最小值。在CiFAR-10,CIFAR-100,ImageNet和MS-Coco中的广泛实验表明,HHF始终如一地优于现有技术,并且将其移植到其他方法中是坚固且柔韧的。
translated by 谷歌翻译
Deep networks for computer vision are not reliable when they encounter adversarial examples. In this paper, we introduce a framework that uses the dense intrinsic constraints in natural images to robustify inference. By introducing constraints at inference time, we can shift the burden of robustness from training to the inference algorithm, thereby allowing the model to adjust dynamically to each individual image's unique and potentially novel characteristics at inference time. Among different constraints, we find that equivariance-based constraints are most effective, because they allow dense constraints in the feature space without overly constraining the representation at a fine-grained level. Our theoretical results validate the importance of having such dense constraints at inference time. Our empirical experiments show that restoring feature equivariance at inference time defends against worst-case adversarial perturbations. The method obtains improved adversarial robustness on four datasets (ImageNet, Cityscapes, PASCAL VOC, and MS-COCO) on image recognition, semantic segmentation, and instance segmentation tasks. Project page is available at equi4robust.cs.columbia.edu.
translated by 谷歌翻译
Spatial-temporal (ST) graph modeling, such as traffic speed forecasting and taxi demand prediction, is an important task in deep learning area. However, for the nodes in graph, their ST patterns can vary greatly in difficulties for modeling, owning to the heterogeneous nature of ST data. We argue that unveiling the nodes to the model in a meaningful order, from easy to complex, can provide performance improvements over traditional training procedure. The idea has its root in Curriculum Learning which suggests in the early stage of training models can be sensitive to noise and difficult samples. In this paper, we propose ST-Curriculum Dropout, a novel and easy-to-implement strategy for spatial-temporal graph modeling. Specifically, we evaluate the learning difficulty of each node in high-level feature space and drop those difficult ones out to ensure the model only needs to handle fundamental ST relations at the beginning, before gradually moving to hard ones. Our strategy can be applied to any canonical deep learning architecture without extra trainable parameters, and extensive experiments on a wide range of datasets are conducted to illustrate that, by controlling the difficulty level of ST relations as the training progresses, the model is able to capture better representation of the data and thus yields better generalization.
translated by 谷歌翻译
Deep learning-based full-reference image quality assessment (FR-IQA) models typically rely on the feature distance between the reference and distorted images. However, the underlying assumption of these models that the distance in the deep feature domain could quantify the quality degradation does not scientifically align with the invariant texture perception, especially when the images are generated artificially by neural networks. In this paper, we bring a radical shift in inferring the quality with learned features and propose the Deep Image Dependency (DID) based FR-IQA model. The feature dependency facilitates the comparisons of deep learning features in a high-order manner with Brownian distance covariance, which is characterized by the joint distribution of the features from reference and test images, as well as their marginal distributions. This enables the quantification of the feature dependency against nonlinear transformation, which is far beyond the computation of the numerical errors in the feature space. Experiments on image quality prediction, texture image similarity, and geometric invariance validate the superior performance of our proposed measure.
translated by 谷歌翻译
凝视估计对于许多科学领域和日常应用至关重要,范围从认知心理学的基本研究到注意力吸引人的移动系统。尽管深度学习的最新进展在建立高度准确的凝视估计系统方面取得了巨大的成功,但相关的高计算成本以及对大规模标记的凝视数据的依赖,以实现对现有解决方案实际使用的监督学习地点挑战。为了超越这些局限性,我们提出了FreeGaze,这是一种用于无监督的注视表示学习的资源有效框架。 FreeGaze在其设计中结合了频域目光的估计和对比度注视表示。前者大大减轻了系统校准和凝视估计中的计算负担,并大大减少了系统延迟。尽管后者克服了现有基于学习的同行的数据标记障碍,并确保在没有凝视标签的情况下确保有效的凝视表示学习。我们对两个凝视估计数据集的评估表明,通过现有基于监督的学习方法,FreeGaze可以在系统校准和注视估计中分别实现高达6.81和1.67倍的速度,以实现可比较的凝视估计精度。
translated by 谷歌翻译